Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377
Conversation
|
Claude finished @dbernstein's task in 3m 46s —— View job SummaryThe three-layer approach (Redis lock, jitter, pick-oldest) is well-reasoned and the implementation is generally clean. The addition of the Details
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3377 +/- ##
=======================================
Coverage 93.37% 93.37%
=======================================
Files 503 503
Lines 46109 46126 +17
Branches 6321 6323 +2
=======================================
+ Hits 43054 43070 +16
- Misses 1979 1981 +2
+ Partials 1076 1075 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reopening — the Redis lock is still needed for within-group serialization. Will layer pick-oldest and jitter on top to handle the cross-shard case. |
59955fd to
67ed73a
Compare
1250a55 to
07ddf48
Compare
tdilauro
left a comment
There was a problem hiding this comment.
This looks good! 🎸
One significant question and a minor suggestion below, but I'm approving it now, in case my question has a happier answer than I expected.
|
|
||
| # How long (seconds) to wait for the folder-creation lock before proceeding without it. | ||
| # Typed as float | int to match acquire_blocking's signature; tests monkeypatch this | ||
| # to a small float (e.g. 0.1) to avoid waiting a full 60 seconds. |
There was a problem hiding this comment.
Minor - probably better to put the test comment in the test file, rather than here, since they might diverge.
| ], | ||
| lock_timeout=timedelta(minutes=5), | ||
| ) | ||
| lock_acquired = folder_lock.acquire_blocking( |
There was a problem hiding this comment.
The PR description mentions that failure to acquire the lock should result in graceful degradation.
Question: Will RedisLock.acquire raise if, for example, Redis is down or not responding for some reason?
If an exception is thrown here, it won't be handled and we won't end up proceeding to the lock-less fallback below.
There was a problem hiding this comment.
You're right that is a problem: I will fix it. If lock acquisition fails due to an issue with Redis, we should just carry on since the purpose of the lock is to reduce the probability of generating duplicate directories not eliminate it entirely.
…(PP-XXXX) Multiple Celery workers running generate_playtime_report concurrently all call create_nested_folders_if_not_exist at the same time. Google Drive allows same-named folders and its list-API has indexing latency, so all workers see "folder not found" and each creates a duplicate. A distributed Redis lock scoped per (root_folder_id, data_source_name) serialises the check-and-create step. If a worker cannot acquire the lock within FOLDER_LOCK_ACQUIRE_TIMEOUT seconds it logs a warning and proceeds without the lock (graceful degradation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aces Three independent layers now guard against duplicate folder creation: 1. Redis lock (existing) — serialises workers within the same Redis instance (i.e. the same CM group), eliminating intra-group races entirely. 2. Random jitter (new) — each worker sleeps random.uniform(0, 30) seconds before acquiring the lock, spreading independent CM groups in time and reducing the probability that two groups race to create the same folder. 3. Pick-oldest (new) — get_file now passes orderBy=createdTime to files.list, so it always returns the oldest matching folder first. If two groups do race and each creates a copy, every worker converges on the same canonical (oldest) folder for all subsequent levels and the file upload. Orphaned empty duplicates may exist but no files are misplaced. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e loop - Change FOLDER_LOCK_ACQUIRE_TIMEOUT type from int to float | int so it matches acquire_blocking's signature and tests can monkeypatch it to a small float (e.g. 0.1) without mypy complaints. - Reduce FOLDER_CREATION_JITTER_MAX from 30 to 10 seconds. - Move the single jitter sleep before the data-source loop so the delay is incurred once per task invocation rather than once per data source, keeping per-data-source overhead constant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If Redis is unavailable (connection error, timeout, etc.) acquire_blocking raises rather than returning False, bypassing the fallback path entirely. Catching the exception and treating it as lock_acquired=False ensures the task proceeds without the lock — logging a warning — instead of aborting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5d02f4b to
436a210
Compare
Description
Three independent layers guard against duplicate Google Drive folder creation in
generate_playtime_report:Redis lock — a
RedisLockscoped per(root_folder_id, data_source_name)serialises workers within the same Redis instance (i.e. the same shard). If the lock cannot be acquired withinFOLDER_LOCK_ACQUIRE_TIMEOUT(60 s), the worker logs a warning and proceeds without it (graceful degradation — no task stalling).Random jitter — each task invocation sleeps
random.uniform(0, FOLDER_CREATION_JITTER_MAX)(up to 10 s) once before the data-source loop. Independent shards share no Redis instance, so the lock alone cannot coordinate them; a single random offset per run spreads their arrival times across all folder levels without multiplying the delay by the number of data sources.Pick-oldest —
get_filepassesorderBy=createdTimetofiles.list, so it always returns the oldest matching folder first. If two groups do race and each creates a copy, every subsequent worker converges on the same canonical (oldest) folder for all levels of the hierarchy and the file upload. Orphaned empty duplicates may exist but no files are misplaced.Residual risk: The three layers together make the collision window very narrow but do not provide a 100% guarantee. A fully watertight fix would require a coordination mechanism shared across all shards (e.g. storing canonical folder IDs in a shared database). We have chosen to accept this low-stakes residual risk to keep the code simple: the task runs monthly, any orphaned folders are cosmetic (no data loss), and the previous behaviour created duplicates on every concurrent run.
Motivation and Context
Multiple CMs grouped in a shard all write reports to the same Google Drive root. When
generate_playtime_reportfires at the same scheduled time, all workers check for the folder hierarchy simultaneously. Because Google Drive allows same-named folders and itsfiles.listAPI is eventually consistent, every worker sees "folder not found" and creates its own copy — producing duplicate same-named directories (e.g. multiple "BiblioBoard" folders). Different CM groups share no Redis instance, so a lock alone is insufficient.JIRA: PP-4434
How Has This Been Tested?
redis_fixtureadded totest_generate_playtime_reportsso the task's Redis call resolves to a real client;FOLDER_CREATION_JITTER_MAXmonkeypatched to 0 to keep the test fast.test_generate_playtime_report_folder_lock_contention: pre-acquires the folder lock, patchesFOLDER_LOCK_ACQUIRE_TIMEOUTto 0.1 s and jitter to 0, asserts the warning is logged and the file upload still completes (graceful degradation).test_get_file_orders_by_created_time: verifiesorderBy=createdTimeis present in the Drive API request URI and that the first (oldest) result is returned when multiple exist.Checklist